Skip to content

[Feature] Add cleanup for terminated RayJob/RayCluster metrics #3923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

phantom5125
Copy link

@phantom5125 phantom5125 commented Aug 7, 2025

Why are these changes needed?

Since some of our metrics are permanently stored in Prometheus, that might cause the /metrics endpoint to become slow or time out, we need a lifecycle-based cleanup.

Related issue number

Closes #3820

End-to-end test example

$kubectl apply -f ray-operator/config/samples/ray-job.sample.yaml

# $kubectl port-forward <kuberay-operator-pod-name> 8080:8080

$curl -s 127.0.0.1:8080/metrics | grep kuberay_
# HELP kuberay_cluster_condition_provisioned Indicates whether the RayCluster is provisioned
# TYPE kuberay_cluster_condition_provisioned gauge
kuberay_cluster_condition_provisioned{condition="true",name="rayjob-sample-clwvk",namespace="default"} 1
# HELP kuberay_cluster_info Metadata information about RayCluster custom resources
# TYPE kuberay_cluster_info gauge
kuberay_cluster_info{name="rayjob-sample-clwvk",namespace="default",owner_kind="RayJob"} 1
# HELP kuberay_cluster_provisioned_duration_seconds The time, in seconds, when a RayCluster's `RayClusterProvisioned` status transitions from false (or unset) to true
# TYPE kuberay_cluster_provisioned_duration_seconds gauge
kuberay_cluster_provisioned_duration_seconds{name="rayjob-sample-clwvk",namespace="default"} 1259.406597953
...

After we clean CR, there will be no more metrics

$kubectl delete rayjob rayjob-sample
$curl -s 127.0.0.1:8080/metrics | grep kuberay_

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@phantom5125 phantom5125 marked this pull request as draft August 7, 2025 19:02
@troychiu
Copy link
Contributor

troychiu commented Aug 7, 2025

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

@phantom5125
Copy link
Author

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

@troychiu
Copy link
Contributor

troychiu commented Aug 9, 2025

Hi @phantom5125 , thank you for creating the PR. Just want to make sure we are in the same page before you start polishing the PR. Do we really need the TTL-based cleanup? I was thinking cleaning up the metric as long as the CR is deleted.

Thanks for your notice!

From my perspective, the independent metricsTTL is primarily intended to address the scenario where JobTTLSeconds is set to 0. In this case, the RayJob CR is deleted immediately after the job finishes. Then metrics like kuberay_job_execution_duration_seconds may not be collected, because it will likely to be deleted as soon as the metric is produced.

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

@phantom5125
Copy link
Author

I think introducing TTL-based cleanup is overkill for this scenario. Instead, we can simply document that setting JobTTLSeconds to a value smaller than the Prometheus scrape interval may cause metrics to be deleted before Prometheus can collect them. I just think we can start with a simpler implementation. What do you think?

Ok, I will take your suggestion and update the PR soon!

@phantom5125 phantom5125 changed the title [Feature] Add TTL-based cleanup for terminated RayJob/RayCluster metrics [Feature] Add cleanup for terminated RayJob/RayCluster metrics Aug 9, 2025
@phantom5125 phantom5125 marked this pull request as ready for review August 9, 2025 19:56
@phantom5125
Copy link
Author

@troychiu PTAL, thanks!

Copy link
Contributor

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the contribution!

@phantom5125 phantom5125 requested a review from troychiu August 13, 2025 16:17

// CreateAndExecuteMetricsRequest is a test helper that creates an HTTP GET request to the /metrics endpoint,
// executes it against a Prometheus handler using the provided registry, and returns the request, response recorder, and handler.
func CreateAndExecuteMetricsRequest(t *testing.T, reg *prometheus.Registry) (*http.Request, *httptest.ResponseRecorder, http.Handler) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard for me to understand the usage of this helper function. It not only calls ServeHTTP to send the request but also returns the handler so that it can be used again. For a test case having multiple request sent, i think it's a bit confusing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3923 (comment)
@win5923 What do you think? I can accept either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a helper function makes sense, but the functionality of the helper is a bit confusing. It would be clearer if it either only returns a handler that caller can reuse or it just sends the request for the caller. Let me know if you feel confused!

Copy link
Contributor

@win5923 win5923 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Troy’s point. we can simply just sends the request and returns the response recorder.

WDYT?

// ExecuteMetricsRequest executes a GET request to /metrics and returns the response recorder
func ExecuteMetricsRequest(t *testing.T, handler http.Handler) (*http.Request, *httptest.ResponseRecorder) {
    t.Helper()
    req, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "/metrics", nil)
    require.NoError(t, err)

    rr := httptest.NewRecorder()
    handler.ServeHTTP(rr, req)

    return rr
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@win5923 I just refactored in the latest commit with:

func GetMetricsResponseAndCode(t *testing.T, reg *prometheus.Registry) (string, int) {
	t.Helper()
	req, err := http.NewRequestWithContext(t.Context(), http.MethodGet, "/metrics", nil)
	require.NoError(t, err)

	rr := httptest.NewRecorder()
	handler := promhttp.HandlerFor(reg, promhttp.HandlerOpts{})
	handler.ServeHTTP(rr, req)

	return rr.Body.String(), rr.Code
}

Does it look ok? I found it not really necessary to reuse the handler and we only care the response message & code in the test code? cc @troychiu

Copy link
Contributor

@win5923 win5923 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think this is better.
Let's just wait for Troy's comment. Thanks!

@phantom5125 phantom5125 requested a review from troychiu August 14, 2025 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add prometheus metrics reset support
3 participants